{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=6>机器学习的最佳实践</font>\n",
    "# 机器学习的最佳实践\n",
    "最佳实践，是一个管理学概念，认为存在某种技术、方法、过程、活动或机制可以使生产或管理实践的结果达到最优，并减少出错的可能性。\n",
    "\n",
    "1. 正确规划项目构建流程；\n",
    "2. 迭代过程中遇到问题的解决方法；\n",
    "<table>\n",
    "    <tr>\n",
    "        <td><img src=\"images/b87dfe87110ca94eca2aa3b171d71d56.png\"></td>\n",
    "        <td><img src=\"images/161d983f9cdcc586a5b79b3161721d6c.png\"></td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td><img src=\"images/17f0bfca6e220f338554fb1dec7d9add.png\"></td>\n",
    "        <td><img src=\"images/62e716e53443c04d65d63ccc6ceba9e8.png\"></td>\n",
    "    </tr>\n",
    "<table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正交化\n",
    "正交化的概念就是指，将你可以调整的参数设置在不同的正交(垂直)的维度上，调整其中一个参数，不会或几乎不会影响其他维度上的参数变化，这样在机器学习项目中，可以让你更容易更快速地将参数调整到一个比较好的数值。\n",
    "## 正交化场景\n",
    "<img src=\"images/7f5034517a97e2091be6f5095fdcb978.png\">\n",
    "\n",
    "## 正交化ML\n",
    "<img src=\"images/e43a775bf7328d50e9efd02f5ea8b43d.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 单一评估指标\n",
    "无论是调整超参数，或者是尝试不同的学习算法，或者在搭建机器学习系统时尝试不同手段，如果存在单一的评估指标，那么开发进展会快得多。这是因为，单一评估指标可以快速告诉我们，新尝试的手段比之前的手段好还是差。\n",
    "\n",
    "## 例1\n",
    "|分类|查准率|召回率|\n",
    "|-|-|-|\n",
    "|A|95%|90%|\n",
    "|B|98%|85%|\n",
    "\n",
    "**混淆矩阵：**\n",
    "<table width=80%>\n",
    "    <tr>\n",
    "        <th></th><th>P:正-Positive</th><th>N:负-Negative</th>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td>T:真-True</td><td>TP:真正,将正类预测为正类数</td><td>TN:真负,将负类预测为负类数</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td>F:假-False</td>\n",
    "        <td>FP:假正,将负类预测为正类数误报 (Type I error)</td>\n",
    "        <td>FN:假负,将正类预测为负类数漏报 (Type II error)</td>\n",
    "    </tr>\n",
    "</table>\n",
    "\n",
    "**评估指标：**\n",
    "<table width=80%>\n",
    "    <tr>\n",
    "        <th  width=15%>评估指标</th><th>解释</th><th>公式</th>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td>准确率</td><td>准确率是我们最常见的评价指标，而且很容易理解，就是被分对的样本数除以所有的样本数，通常来说，正确率越高，分类器越好。</td><td>$Accuracy=\\frac{T}{T+F}$</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td>错误率</td><td>错误率则与准确率相反，描述被分类器错分的比例。</td><td>$Error\\_rate=\\frac{F}{T+F}=1-Accuracy$</td>\n",
    "    </tr>   <tr>\n",
    "        <td>特效度</td><td>表示的是所有负例中被分对的比例，衡量了分类器对负例的识别能力。</td><td>$sensitive=\\frac{TN}{N}$</td>\n",
    "    </tr>    <tr>\n",
    "        <td>查准率</td><td>表示被分为正例的示例中实际为正例的比例。</td><td>$precision=\\frac{TP}{P}$</td>\n",
    "    </tr>    <tr>\n",
    "        <td>召回率</td><td>召回率是覆盖面的度量，度量有多个正例被分为正例</td><td>$recall=\\frac{TP}{TP+FN}$</td>\n",
    "    </tr>\n",
    "</table>\n",
    "\n",
    "**$F1$值:**\n",
    "\n",
    "<img src=\"images/575144df22e455c4269043b1bb3d2d5c.png\" width=60%>\n",
    "\n",
    "**单一评估:**\n",
    "\n",
    "|分类|查准率|召回率|$F1$值|\n",
    "|-|-|-|-|\n",
    "|A|95%|90%|92.4%|\n",
    "|B|98%|85%|91.0%|\n",
    "\n",
    "## 例2\n",
    "求平均值，看平均表现：\n",
    "<img src=\"images/d16788ba017b04733720dfe7969ef838.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 满足和优化指标-另外一种多指标评估方法\n",
    "1. 优化指标作为评估指标，提供模型设定成本值得依据；\n",
    "2. 满足指标作为过滤器；\n",
    "3. 先过滤，在评估；\n",
    "<img src=\"images/f4574276832597fbc9d5977401896d15.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 训练/开发/测试集\n",
    "## $train,dev,test$\n",
    "<img src=\"images/163d77ba74a55f627d201cd2d9ae8f07.png\" width=60%>\n",
    "\n",
    "## 原则\n",
    "1. 保证来自于同一分布空间，即子集与全集是同一分布，方法：从全集中随机抽取；\n",
    "2. 开发和测试集的划分，要考虑项目的应用场景，做到不违反第一原则的情况下，有所倾斜；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 资源、效率和基准\n",
    "1. 资源（时间）作为输入，效率（准确率）作为产出，基准代表合格的标准；\n",
    "2. 以人工的准确率作为基准；\n",
    "3. 在持续投入下，机器同人工的表现；\n",
    "    1. 加速追赶阶段；\n",
    "    2. 到达临界点，得到认可；\n",
    "    2. 减速超越阶段；\n",
    "4. 机器效率曲线在临界点前后的不同表现的原因；\n",
    "    1. 存在理论最优值，投入产出比边际递减；\n",
    "    3. 人类的表现是经过较长时间进化而来，在绝对值上是被认可的；\n",
    "    4. 在超越之前，可以及固化人工的方法提高机器表现，超越之后没有这么易得的知识；\n",
    "    \n",
    "<img src=\"images/42053491e39c1b60da8755c035cd4d95.png\" width=90%>\n",
    "\n",
    "## 可避免偏差（Avoidable bias）\n",
    "1. 偏差是相对于零误差而言；\n",
    "2. 可避免偏差是相对人工误差而言；\n",
    "3. 方差是相对于训练误差而言。\n",
    "\n",
    "### 例\n",
    "<table>\n",
    "    <tr>\n",
    "        <th>指标</th><th>基准误差</th><th>训练误差</th><th>开发/验证误差</th><th>偏差</th><th>方差</th><th>优化点</th>\n",
    "    </tr><tr>\n",
    "        <td>例1</td><td>%0</td><td>8%</td><td>10%</td><td>8%</td><td>2%</td><td>优化偏差</td>\n",
    "    </tr><tr>\n",
    "        <td>例2</td><td>%4</td><td>8%</td><td>10%</td><td>4%</td><td>2%</td><td>优化偏差</td>\n",
    "    </tr><tr>\n",
    "        <td>例3</td><td>%7.5</td><td>8%</td><td>10%</td><td>0.5%</td><td>2%</td><td>优化方差</td>\n",
    "    </tr>\n",
    "    \n",
    "</table>\n",
    "\n",
    "**总结：**\n",
    "\n",
    "1. 通过偏差和误差的相对大小对问题的定位；\n",
    "2. 偏差和误差的相对大小，受基准误差设定的影响；\n",
    "3. 先设定人工误差为基准误差，随着模型不断迭代优化，可以提高基准门槛；\n",
    "4. 模型开发是一个螺旋上升的迭代过程；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 误差分析\n",
    "某类错误率较大，应该针对性的优化，否则，可以允许这种误差存在\n",
    "##  建立错误分类表格，归纳错误类型\n",
    "1. 分类错误类型：\n",
    "    1. 假阳性-误报；\n",
    "    1. 假阴性-漏报；\n",
    "2. 原因错误类型：\n",
    "    1. 图片模糊；\n",
    "    2. 声音存在噪音；\n",
    "    3. 错误标记；\n",
    "    4. 数据异常；\n",
    "<table>\n",
    "    <tr><th>错误样本编号</th><th>$Error_1$</th><th>...</th><th>$Error_n$</th></tr>\n",
    "    <tr><td>1</td><td>1</td><td></td><td></td></tr>\n",
    "    <tr><td>$\\vdots$</td><td></td><td></td><td></td></tr>\n",
    "    <tr><td>n</td><td></td><td></td><td>1</td></tr>\n",
    "    <tr><td>$\\frac{sum}{n}$</td><td>a%</td><td>b%</td><td>c%</td></tr>\n",
    "</table>\n",
    "\n",
    "## 例\n",
    "<img src=\"images/e5c7f1005d695914f4a2fc988aa46821.png\">\n",
    "\n",
    "## 误差分析是一个人工反馈的好办法\n",
    "1. 能够帮助找到明确的待优化问题；\n",
    "2. 加深对待解决问题和数据集的了解；\n",
    "3. 手工操作在开发阶段必不可少，开发阶段更多的人工操作是为了应用阶段更少的人工干预；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 快速搭建原型，进入迭代状态\n",
    "## 原因\n",
    "1. 机器学习模型可选择的功能模块很多，各模块可选择的方法很多；\n",
    "2. 原则：$idea\\to code\\to experiment\\to idea$\n",
    "\n",
    "## 步骤\n",
    "1. 简单的实现，设定单一评估指标，完成训练和测试；\n",
    "2. 根据偏差和方差，选择优化方向；\n",
    "3. 改进之后，从2开始迭代；\n",
    "4. 提高基准门槛，从2开始迭代；"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
